Towards the Automatic Construction of Conceptual Taxonomies
نویسندگان
چکیده
In this paper we investigate the possibility of an automatic construction of conceptual taxonomies and evaluate the achievable results. In our meaning, a concept is represented by a keyword contained and extracted from a text corpus. A conceptual taxonomy is then a hierarchical organization of the keywords (Keyword Hierarchy, KH) such that the keywords at the higher hierarchy levels are representatives of a higher number of other keywords and as a consequence there is a higher number of documents that contain them. The hierarchy construction, on the space of the keywords, is performed by Ward hierarchical clustering algorithm, guided by a keyword proximity measure that is wellknown in statistics for the association between two categories: Goodman-Kruskal τ . Then, we perform a choice of a keyword representative from each cluster of keywords in the hierarchy. This is done in a similar way in which PageRank determines the authority of Web pages. The obtained hierarchy has the same, several advantages both descriptive and operative of indices on keywords which perform a partitioning of a large document collection with respect to the search space of the contents. In addition, we have a description of each cluster in the hierarchy by a list of frequent occurring terms ordered by their authority score determined by PageRank. We performed experiments in a real case the abstracts of the papers published in the last 8 years in ACM Transactions on Database Systems Journal in which the papers have been manually classified into the ACM Computing Taxonomy (CT) whose categories were created by computer science experts. The obtained keyword hierarchy provides interesting insights on the documents content and constitutes a useful browsing tool on the papers. We evaluated also objectively the generated hierarchy. We evaluated the correspondence between the clusters in KH and the categories in CT both by Jaccard measure on the documents and by entropy in terms of the category of documents in the clusters and obtained good results. We also evaluated the capability of classifiers to classify in the categories of the two taxonomies showing that KH provides a greater facility than CT.
منابع مشابه
Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text
The application of clustering methods for automatic taxonomy construction from text requires knowledge about the tradeoff between, (i), their effectiveness (quality of result), (ii), efficiency (run-time behaviour), and, (iii), traceability of the taxonomy construction by the ontology engineer. In this line, we present an original conceptual clustering method based on Formal Concept Analysis fo...
متن کاملComparing Conceptual, Divisive and Agglomerative Clustering for Learning Taxonomies from Text
The application of clustering methods for automatic taxonomy construction from text requires knowledge about the tradeoff between, (i), their effectiveness (quality of result), (ii), efficiency (run-time behaviour), and, (iii), traceability of the taxonomy construction by the ontology engineer. In this line, we present an original conceptual clustering method based on Formal Concept Analysis fo...
متن کاملTowards (Semi-)automatic Generation of Bio-medical ontologies
The design and construction of domain specific ontologies and taxonomies requires allocation of huge resources in terms of cost and time. These efforts are human intensive and we need to explore ways of minimizing human involvement and other resources. In the biomedical domain, we seek to leverage resources such as the UMLS Metathesaurus and NLP-based applications such as MetaMap in conjunction...
متن کاملResolving Task Specification and Path Inconsistency in Taxonomy Construction
Taxonomies, such as Library of Congress Subject Headings and Open Directory Project, are widely used to support browsing-style information access in document collections. We call them browsing taxonomies. Most existing browsing taxonomies are manually constructed thus they could not easily adapt to arbitrary document collections. In this paper, we investigate both automatic and interactive tech...
متن کاملPattern-based automatic taxonomy learning from the Web
The construction of taxonomies is considered as the first step for structuring domain knowledge. Many methodologies have been developed in the past for building taxonomies from classical information repositories such as dictionaries, databases or domain text. However, in the last years, scientists have started to consider the Web as valuable repository of knowledge. In this paper we present a n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008